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Abstract: Automatic text classification, which is defined as 
the process of automatically classifying texts into predefined 
categories, has many applications in our everyday life, and it 
has recently gained much attention due to the increased num- 
ber of text documents available in electronic form. Classify- 
ing News articles is one of the applications of text classifica- 
tion. Automatic classification is a subset of machine learning 
techniques in which a classifier is built by learning from 
some pre-classified documents. Naïve Bayes and k-Nearest 
Neighbor are among the most common algorithms of ma- 
chine learning for text classification. In this paper, we sug- 
gest a way to improve the performance of a text classifier 
using Mutual information and Chi-square feature selection 
algorithms. We have observed that MI feature selection 
method can improve the accuracy of Naïve Bayes classifier 
up to 10%. The empirical results show that the proposed 
model achieves an average accuracy of 80% and an average 
Fl-measure of 80%. 


Keywords: Automatic Persian text classification, K-Nearest 
Neighbor, Naive Bayes, News text classification, Text cate- 
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1. Introduction 

With the rapid growth of electronic text documents gener- 
ated every day on the Internet, text classification has gained 
more importance in recent years [1]. Text classification, also 
known as text categorization, is the process of assigning 
class labels to a text document according to its content [1]. 
Text classification has been successfully used in domains 
such as topic detection, spam e-mail filtering, news text clas- 
sification, web page classification, author recognition, and 
sentiment analysis. 

The news was not easily accessible until the beginning of 
the 21st century, but today the news is readily available on 
the Internet. Moreover, in the past only a small group of peo- 
ple needed international news, such as politicians, and the 
news required by most people was limited to local news. In 
other words, ordinary people did not need global news 
and therefore did not follow it; however, today people fol- 
low the worldwide news and show more interest in it. There- 
fore, news text classification is now a challenging field in 
text mining approaches. News text classificationis de- 
fined as classifying news articles in one or more classes. 
Classification of news helps the users to easily access their 
desired news without wasting their time. 
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Bahareh Pahlevanzadeh? 


Mohammad Reza Falahati Qadimi Fumani? 


Considering the great number of texts available, manu- 
ally classifying text documents is time-consuming, expen- 
sive, and even impossible; therefore, it is better to use auto- 
matic classification techniques for classifying news articles. 
In this regard, there are two main approaches to classify doc- 
uments automatically: rule-based approach and machine 
learning approach. In the rule-based approach, a set of rules 
are written by human experts, and the classification pro- 
cess is done according to these rules. In machine learning ap- 
proaches, a classifier is built by learning from some pre-clas- 
sified documents. 

One of the most challenging tasks in text classification is 
feature selection [2, 3]. Feature selection is the process in 
which a subset of the most relevant features is selected from 
the feature space [4]. This paper uses two feature selec- 
tion methods, including Mutual Information (MI) and Chi- 
square (CHI), to enhance the performance of classifier 
model. In this paper, a comparison of these two methods of 
feature selection is presented and discussed as well. 

A few research pieces have been conducted on building 
classifier models to classify texts in Persian. These studies 
are mostly done using Hamshahri corpus as their dataset, and 
there is a lack of research on any other Persian dataset. Au- 
tomatic Persian text classification based on Persika corpus, a 
collection of Persian news articles collected from ISNA,* has 
been studied once in which MI and Chi-square are applied to 
improve the performance of the classifier algorithm. There- 
fore, the main aim of the present study is to build a classifier 
model for Persian news texts based on Persika dataset using 
KNN and Naive Bayes as the classifier algorithms and the 
MI and chi-square as the feature selection algorithms to see 
how these feature selection methods improve the accuracy of 
the classifier. 

In the second section of the paper, a brief literature on 
text classification is given to review the most common text 
classification techniques used by researchers. Besides, a 
comprehensive research literature on Persian text classifica- 
tion and, more specifically, on Persian news text classifica- 
tion is offered. In the third section of the paper, 
the method used in this study is discussed. In the fourth sec- 
tion, the evaluation metrics are introduced, and the model is 
evaluated using different evaluation metrics. And in the last 
section, the evaluation of the final proposed model for Per- 
sian news classification is discussed. Finally, the paper ends 
with a conclusion and some avenues for future studies are 
also suggested. 


' MSc Student, Department of Computational Linguistics, Regional Information Center for Science and Technology (RICeST), Shiraz, 


Fars, Iran. 


2 Corresponding Author. Assistant Professor, Department of Design and System Operations, Regional Information Center for Science 
and Technology (RICeST), Shiraz, Fars, Iran. Email: pahlevanzadeh @ricest.ac.ir. 


3 Associate Professor, Department of Computational Linguistics, Regional Information Center for Science and Technology (RICeST), 


Shiraz, Fars, Iran. 


2. Related Work 

2.1 Overview of the State-of-the-art Algorithms of Text 
Classification 

According to Dalal and Zaveri , the history of text classifica- 
tion goes back to 1961. In the traditional approach, text clas- 
sification was done using knowledge engineering techniques 
in the 1980s [5], which consisted of manually defined rules. 
Because the method is based on some logical rules, it is 
known as the rule-based approach. Since in the rule-based 
approach, the logical rules are written by human experts, 
building models in this approach is so expensive and time- 
consuming. Moreover, this approach is more computation- 
ally complicated [6]. Because of the problems of the rule- 
based approach, the machine learning approach has gained 
much popularity and has attracted many researchers’ atten- 
tion since the early 1990s [7, 8]. The machine learning ap- 
proach is faster and more straightforward and does not need 
a vast number of human experts. 

During the past decades, many machine learning tech- 
niques of text classification have been introduced and stud- 
ied by researchers of different languages, most of which are 
for the English language. There are different machine learn- 
ing algorithms for text classification, among which the most 
common ones are Naive Bayes (NB) [9], k-Nearest Neighbor 
(KNN) [10], Decision Tree (DT) [11], Support Vector Ma- 
chine (SVM) [12], and Neural Networks (NN) [13]. 

An exhaustive overview of the state-of-the-art algorithms 
of machine learning for text classification has been achieved 
by many authors such as [5-8]. Therefore, we only provide 
an overview of approaches used for text classification. 


2.1.1 Naive Bayes 
Probabilistic classifiers have attracted much attention in re- 
cent years. Naive Bayes classifiers are the most popular 
probabilistic approaches used in text classification in the lit- 
erature [2]. Naive Bayes classifiers are a group of classifiers 
using the Bayes rule with the assumption that the distribution 
of all terms in a document is independent of others. Naive 
Bayes classifiers are called Naive since the early 90s because 
the so-mentioned assumption is not true in the real world [2]. 
An experiment on the naive Bayes text classifier was car- 
ried out by McCallum and Nigam (1998) [9]. In their work, 
the authors compared two standard event models of Naive 
Bayes (i.e., multinomial event model and multivariate Ber- 
noulli model). McCallum and Nigam believe that the Naive 
Bayes algorithm is the simplest model among probabilistic 
models. Besides, they maintain that Naive Bayes classifier 
works surprisingly well even though its primary assumption 
about the independence of attributes is not true in the real 
world. Other researchers believe that the performance of Na- 
ive Bayes is very good in comparison with other text classi- 
fication algorithms [14-18]. In more recent research, scien- 
tists have attempted to improve the performance of naive 
Bayes using different methods [19-21]. Many researchers 
have attempted to improve Naive Bayes by applying feature 
selection algorithms on the classifier to reduce the high di- 
mensionality of feature space [22-24]. 


2.1.2 K-Nearest Neighbor (KNN) 

K-nearest neighbor is an example-based non-parametric 
classifier; it is one of the simplest and most efficient classi- 
fiers used in text classification. KNN is mostly used in text 
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classification for its low calculation time and low computa- 
tional complexity [25]. KNN classifiers are grouped under 
the category of lazy learners. In fact, they are called lazy be- 
cause they postpone the decision making about the test doc- 
ument until meeting all the training documents. Yang and 
Pederson (1997) [26] were among the first authors investi- 
gating the KNN classifier; however, some other researchers 
have also shown KNN to be effective [26-33]. In more recent 
research, scientists have attempted to improve the perfor- 
mance of KNN using different methods [34], whereas some 
researchers have attempted to use feature selection methods 
to improve the performance of KNN classifier [35, 36]. 


2.2 Overview of Studies on Text Classification for the Per- 
sian Language 

In a pioneering study [37], a distributed classification of Per- 
sian news articles was proposed using Mapreduce as a pro- 
gramming model and the Hamshahri dataset as the corpus. 
The results of this study showed an average recall of 63.75% 
and an average precision of 52.67% [37]. 

In another study, the researchers used the Learning Vec- 
tor Quantization (LVQ) algorithm for classifying Persian 
texts and compared their proposed method with KNN and 
SVM classifiers. They showed that the LVQ model would 
perform faster than other algorithms in terms of classifying 
Persian texts. They have reached an average f-measure of 
89% as such [38]. 

In another model for Persian news text classification, 
KNN and SVM classifiers and TF-IDF feature weighting ap- 
proach were used through which Hamshahri dataset was 
used as the corpus. The authors of this paper showed that 
KNN would perform better in classifying Persian texts. They 
have reached an average f-measure of 94% using their pro- 
posed model [29]. 

In another study, the researchers suggested using a the- 
saurus to improve the SVM classifier for Persian news texts. 
They used a corpus of news articles collected from different 
newspapers and Wikipedia and achieved a micro f-measure 
of 89% [39]. 

In another study [40], KNN classifier for classifying Per- 
sian news texts was proposed using the n-gram model to im- 
prove the efficiency of classifier. The authors compared their 
proposed model with the model in which a thesaurus would 
be used to improve the performance of the SVM classifier. 
In their study, the Hamshahri dataset was used to train the 
classifier; they gained a micro f-measure of 91% [40]. 

In another study [41], the KNN classifier and the Word- 
Net were used to improve the performance of KNN. In addi- 
tion, they applied two feature selection algorithms, IG and 
PCA, and they gained an accuracy of 88.18% using the Ham- 
shahri dataset as the corpus [41]. 

Another study suggested using a PSA feature selection 
algorithm to improve the performance of the classifier for 
Persian text classification [42]. This study used a corpus of 
Persian news articles conducted by the authors. The authors 
of this paper also compared their proposed feature selection 
algorithm with two other feature selection methods, chi- 
square and correlation coefficient. They have gained an f- 
measure of 87% for their proposed model. 
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Table 1. Related Literature on the Persian Language 


Paper/Authors year Dataset (corpus) Measurement parameters 
[37] 2005 Hamshahri recall 63.75% precision 52.67%. 
Esmaeili et al. 
: [38] 2009 Hamshahri2 f-measure 89% 
Pilevar et al. 
[29] : f-measure 94%. 
Farhoodi and Yari 2010 AGS 
[39] 
Maghsoodi and Homayounpour 2011 Sample corpus f-measure 89%. 
[40] ; 
Platiminesh etal. 2012 Hamshahri f-measure 91%. 
[41] ; 
Parchami et al. 2012 Hamshahri accuracy 88.18% 
[42] 
Bagheri et al. 2014 Sample corpus f-measure 87% 
[43] 7 
Ahmadi coal 2016 Bijankhan dataset accuracy of 87%. 
[44] ; 
Dastgheib and Koleini 2019 | Scholarly articles from RICeST f-measure 83% 


In another study, the authors investigated applying topic 
models for Persian text classification [43]. They used an 
SVM classifier for their investigation and reached an accu- 
racy of 87% [44] by using the latent semantic indexing 
(LSI) model instead of the traditional model of representing 
texts for text classification called the Vector space model. 
They used KNN and SVM classifier algorithms to classify 
the scholarly articles collected from the RICeST!>Persian 
articles repository. They showed that using the LSI model 
would improve the performance of the classifier model. 
They also reached an f-measure of 83% using their proposed 
model. 


2. Materials and Methods 
As shown in Figure 1, building a text classifier usually con- 
sists of the following steps: 


Dataset Data Text Training Evaluatio 
5 Feature r 
Preparatio representa Preproces F the n of the 
> 5 Selection : 
n tion sing Classifier Model 


Figure 1. Building a Classifier Model 


This paper has followed the steps above to build a model 
for classifying Persian news articles. 


2.1 Dataset Preparation 

As mentioned earlier, this research is a case study which 
uses the Persika corpus to train a classifier. Persika dataset 
is a corpus of Persian news articles collected from the ISNA 
news website, one of the most reliable and known news 
agencies for Persian news. Persika is the only standard news 
corpus which uses articles from ISNA [45]. Persika contains 
11000 news articles categorized under 11 categories. The 
data in Persika are balanced. That is, each of the 11 classes 
in Persika consists of 1000 news articles. These 11 classes 
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concern with sports, economy, culture, religion, history, 
politics, science, society, education, judiciary, and hygiene. 

The dataset used in text classification tasks are divided 
into two parts: the train set and the test set [46]. There are 
different ways of dividing the dataset into the train and test 
sets. In this study, we use cross-validation (CV) to do this 
job. In cross-validation, the dataset is divided into k folds. 
The model is then built using one fold as the test set and k- 
1 folds as the training sets [47]. The process is repeated k 
times so that each of the k folds has been used once as the 
test set [47]. Then, the average error of these k times repeti- 
tion is called the cross-validation error, which shows the 
performance of model. In other words, true positive, true 
negative, false positive, and false negative for each fold are 
calculated, and then the evaluation measures (i.e., accuracy, 
precision, recall, and f-measure) are calculated. In this 
study, a 10-fold cross-validation method is adopted. 


2.2 Data Representation 

Data in this study are represented using the Bag-of-Words 
model (BoW), which is the most common way of text doc- 
ument representation [48]. In this model, a document is rep- 
resented as a vector V={tw1, twa, ...... twv }., where twi is 
the weight assigned to the term I [49]. In the BOW model, 
the order of the words is not considered, and only the fre- 
quency of each term is considered. 


2.3 Data Preprocessing 

Since most of the available text documents are in an unstruc- 
tured form, the text on which the training process is based 
should be preprocessed. Preprocessing is an essential part of 
building a classifier model that can positively affect the 
model's accuracy. In this research, the preprocessing stage 
consists of three steps, which are shown in Figure 2. 


Stop Words 


Normalization Remol 


Tokenization 


Figur 2. Preprocessing 


In the tokenization process, a text document is broken 
into its tokens. In the normalization process, the non-stand- 
ard tokens and structures in a text document are either re- 
moved or standardized. Stop words are frequent words in a 
text document that do not contain important information. 
Removing the stop words reduces the complexity of the 
model and improves the classifier’s performance. 


2.4 Feature Selection 

Text classification usually faces the problem of the high di- 
mensionality of feature space, which is the vast number of 
terms in a text document [50]. Thus, a process is needed to 
reduce the dimensions of feature space by choosing more 
relevant and effective features [51]. This process is called 
dimensionality reduction, also known as feature selection. 
Feature selection reduces the computational complexity of 
the model and therefore improves the classifier performance 
[52]. Feature selection can improve the efficiency and accu- 
racy of a text classifier [54]. It is also beneficial in reducing 
the overfitting (i.e., when a classifier is adjusted to both the 
dependent characteristics of the training data and the consti- 
tutive features of the categories) [5]. 

There are two main types of feature selection algorithms: 
Filter methods and Wrapper methods [53]. Wrapper meth- 
ods use the learning algorithm to evaluate the features. The 
accuracy of the learning algorithm based on a feature reveals 
the effectiveness of that feature. Wrapper methods are more 
time-consuming than filter methods because they have to 
train a classifier to evaluate each feature and that they work 
only for a limited set of classifiers. In contrast to wrapper 
methods, filter methods work independently from the learn- 
ing algorithm and are less time-consuming. Filter methods 
measure the importance of each feature using some func- 
tions and then select the most essential features. 

Since filter methods are more straightforward and less 
time-consuming than wrapper methods, they are more suit- 
able for text classification tasks in which there is a large 
number of features. Some of the most popular feature selec- 
tion algorithms included in the filter group are x? statistics 
(CHI), Information Gain (IG), Mutual Information (MD), 
and document frequency (DF). 

In this paper, we use two feature selection algorithms, 
namely Mutual Information (MI) and Chi-square (CHD), to 
see to what extent they improve the efficiency of the classi- 
fier and compare the performance of these two feature se- 
lection methods. In addition, TF-IDF, which is a term 
weighting method, is applied before MI and CHI. 


2.4.1 Term Frequency-Inverse Document Frequency 
(TF-DF) 

TF-IDF is a crucial method of weighting features in a text 
document to select the most relevant features. The relevance 
of the word to the document is calculated by the weight as- 
signed to each term in a text document. TF-IDF is a 
weighting method that assigns a weight to a term by consid- 
ering the term frequency and inverse document frequency 
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[1]. In TF-IDF, a word takes a high weight if its frequency 
in the document is high, and it takes a low weight if the doc- 
ument frequency, the number of training documents con- 
taining term t is high [54]. TF-IDF is calculated using the 
formula below [55]: 


N 
Wij = tfij * logs (1) 


2.4.2 Chi-square (CHI) 

CHI is a recognized statistical test that calculates the corre- 
lation between term t and class ci [56]. In other words, it 
measures the amount to which the term t and class c; are 
correlated. Chi-square outperforms other feature selection 
algorithms, such as information gain and document fre- 
quency [57]. CHI is calculated using the formula below 
[54]. 


N(AD-CB)* 


x2 (t,¢) = (A+C)(B+D)(A+B)(C+D) 


(2) 
Where N is the number of all training documents, A is 
the number of documents in class c that contain term t, B is 
the number of documents that contain term t and are not in 
class c, C is the number of documents that are in class c and 
do not contain the term t, and finally D is the number of 
documents in class c that do not contain the term t [54]. 


2.4.3 Mutual Information (MI) 

MI is a measure calculating the dependency between two 
variables. These two variables in text classification tasks are 
a term t and a class c. If the MI between a term tx and a class 
ciis zero, then tand ci are entirely independent. 

Ml is defined in the following: [58] 


(t,c) 
MI (t,c) = log- oui = (3) 


2.4.4 Training the Classifier 

As mentioned earlier, NB and KNN are among the simplest, 
most effective, and most applicable algorithms. These two 
classifier algorithms have not been used and compared for 
Persian news text classification by applying MI and Chi- 
square feature selection algorithms. Therefore, in this study, 
NB and KNN classifiers are used to build the classifier 
model using MI and chi-square as the feature selection 
methods to see how these feature selection algorithms im- 
prove the efficiency of the model. 


2.4.5 Naive Bayes classifier 
As mentioned earlier, the multinomial Naive Bayes (MNB) 
classifier is a probabilistic classifier especially designed for 
text classification. Although in Naive Bayes the main as- 
sumption about the complete independence of the attributes 
is not true in the real world, it performs surprisingly well in 
text classification [59,60]. Naive Bayes classifier uses the 
Bayes rule to estimate the probability that document d be- 
longs to class C. 

This is the so-called Bayes rule on which the Naive 
Bayes classifiers are based [9]. 


alc 
P(cld) = “CPR (4) 


In text classification, a document is usually represented 
as a vector v={ti, t2, ..... , tk}. Given the fact that for i £j, vi 
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and vj are conditionally independent in terms of the class c. 
We can rewrite Eq. (3.4) as: 


MjarP(YjlC) 


P(c|d) = P(c) * mas 


(5) 
After computing P(c | d), we can construct maximum a 

posterior (MAP) classifier by selecting the category that 

maximizes P(c | d) using the formula below [2]: 


C = argmaxcec {P(c | d)} (6) 


k P(v |c 
C = argmaxcec{P(c) * Mwl 
k 
C = argmaxcec{P (c) * I] P(v; | c)} 
j=l 

(7) 
There are two models of Naive Bayes classifier used for 
text classification: Multivariate Bernoulli model and Multi- 
nomial model. In the Multivariate Bernoulli model, the fre- 
quency of the terms is ignored because in this model the 
document is represented by a vector of binary features rep- 
resenting the presence or absence of the words in the text. 
In this model, a vocabulary V is given. A document is rep- 
resented with a vector of | VI dimensions [53]. The kth 
dimension of the word corresponds to word wx from V and 
is either 1 or 0, indicating whether word wx occurs in the 
document. If document dj is represented with a vector {t1, t2, 

.. ty} then we can compute p (di | cj) as: 


P(dile;) = TR, P(welg)* a -Pwd ® 


In the Multinomial model, the frequency of the terms is 
considered because in this model a document is represented 
using the bag-of-words model. In this model, the order of 
the words is not considered. The Multinomial model is more 
effective when working with large datasets. Therefore, for 
text classification in which the vocabulary size is large, this 
model would be better than the Multivariate Bernoulli 
model. 

If the frequency of word wx in document di is represented 
as Nix, then P (di | cj) can be computed in the following: 


ue) 


P(dilgy) = Pd) al TH, a (9) 


In both models, the probability of class cj, that is P(cj), 
can be computed as: 


1+n; 
P(c;) a7 (10) 
Where nj is the number of documents in class cj and nan 
is the number of documents in class cj in the training set D 
Also, P(wx| cj) can be computed in the following: 


1+ Ncjk 


P(w|cj) = (11) 


Nall+t nj 

Where nj; is the number of words in class cj, Nejk is the 
number of word wk in class cj, and Nan is the number of all 
words in the training set D. 


2.4.6 K-Nearest Neighbor (KNN) 
As we have mentioned before, KNN is an instance-based 
classifier with high accuracy in text classification. 

The main idea in KNN is comparing the test document 
with a set of neighboring training sets. In fact, in KNN, the 
similarity between the test set and the k training sets is cal- 
culated using a similarity measure. Then, the test document 
is labeled with the class by which most of its neighbors are 
labeled. 

As mentioned earlier, for building a KNN classifier, we 
need to determine a threshold k. The choice of parameter k 
is an important and effective step in building a KNN classi- 
fier. In this study, we use the empirical method of determin- 
ing k. This method is explained in section 4. 

Different similarity measures can be used in KNN clas- 
sifiers, such as Euclidean distance, Cosine similarity, etc. In 
this research, the Euclidean distance is used to measure the 
similarity between the test document and the training docu- 
ments. Euclidean distance between two documents is calcu- 
lated using the formula below [59]: 


s (d;,d;) = D8 (dig — dy)? (12) 


3. Results 

This section first briefly introduces the evaluation metrics 
used to evaluate the models. Different classifier models us- 
ing Naive Bayes and KNN classifiers and MI and CHI as 
feature selection algorithms are built and evaluated. Finally, 
a model for classifying Persian news articles is suggested, 
and the proposed model is evaluated in terms of accuracy, 
precision, recall, and f-measure. 


3.1 Evaluation Metrics 

The evaluation of a document classifier is usually done ex- 
perimentally. The experimental evaluation of a classifier 
usually measures its effectiveness, which is its ability to 
make the right classification decision. There are different 
metrics for measuring the classification effectiveness, in- 
cluding precision, recall, and f-measure and accuracy. In 
this paper, the evaluation of classifier models is done in 
terms of these four evaluation metrics using the contingency 
table. The contingency table, as shown in Table 2, indicates 
the distribution of correctly and wrongly classified docu- 
ments. 

Table 2. Contingency Table 


Category set Expert judgment 
C= {c1, €2,.....,C1 a } Yes No 
Classifier Yes TPi FPi 
judgment No FNi TNi 


***Tn the above table, TP is the number of documents that 
are correctly labeled positive. TN is the number of docu- 
ments that are correctly labeled negative. FP is the number 
of documents that are wrongly labeled positive, and FN is 
the number of documents that are wrongly labeled negative. 
*x*xTo measure P and R's values, two different methods can 
be adopted: micro-averaging and macro-averaging. In mi- 
cro-averaging, P and R are calculated by summing total sin- 
gle decisions about each category [5]. In macro-averaging, 
firstly, P and R are calculated for each category and the av- 


erage of the results of the different categories [5]. In this pa- 
per, the macro-averaging method is used; therefore, wher- 
ever in this paper the word average is used with the evalua- 
tion metrics, the macro-averaging method is meant. 


3.1.1 Accuracy 

The accuracy measure is the ratio of correctly predicted ob- 
servation to the total observation. The accuracy formula is 
as follows [5]: 


TP+TN 
= eee (13) 
TP+FP+TN+FN 

3.1.2 Precision 

The precision measure is the ratio of correctly predicted pos- 
itive observation of the ratio of all the retrieved data. The 
precision formula is as follows [5]: 

TP 


P= (14) 


TP+Fp 


3.1.3 Recall 

The recall measure is the ratio of correctly predicted positive 
observations to all the observations in actual class positive. 
The recall formula is as follows [5]: 


Sos SEP 
~~ TP+FN 


(15) 


3.1.4 F-measure 
The f-measure is the weighted average of precision and re- 
call. It is calculated as follows [5]: 

2RP 
F= as (16) 
3.1.5 Evaluation 
The evaluation of models has been done in 4 phases as 
shown below: 


Choosing 5 
7 comparing NB and A 
between the title KNN and Choosing the Evaluating the 
and the body for comparing MI and Wiest ichin roposed model 
the training $ CHI space length Bd 


process 


Figure 3. The Evaluation Process 


3.1.6 Choosing between the title and the body for the 
training process 

Persika dataset has seven columns, namely news-ID, title, 
body, date, time, category 1, and category 2. In this paper, 
we deal with the title and the body of the news articles and 
the column named category 2. To train the classifier, we can 
use the title, the body, or both of them. To see which of these 
three was the most effective way of gaining high accuracy, 
we compared the use of these three situations. The experi- 
mental results showed that using both the title and the body 
of the news articles can be the most effective ones. 


3.1.7 Comparing NB with KNN and comparing MI with 
CHI 
In the second phase of our experiment, we compared NB 
with KNN classifiers to see which one would outperform 
the other in classifying Persian news articles. 

As the choice of parameter k is an essential part of build- 
ing a KNN classifier, we used different amounts of 1, 3, 5, 
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7, and 9 for parameter k to see what amount of k would give 
the best accuracy of the classifier. Table 3 shows the result 
of this experimental analysis. 

Table 3. Choice of Parameter k 


Evaluation Metric | K=1 | K=3 | K=5 | K=7 | K=9 
Average Accuracy | 0.72 | 0.73 | 0.75 | 0.76 | 0.76 
Average Precision | 0.72 | 0.75 | 0.76 | 0.76 | 0.77 

Average Recall 0.72 | 0.73 | 0.75 | 0.76 | 0.76 
Average F-measure | 0.71 | 0.73 | 0.75 | 0.75 | 0.76 


As shown in Table 3, using nine nearest neighbors 
among the training set would be the best. Therefore in the 
rest of this paper, the variable k in the KNN algorithm has a 
value of 9. 

The classifier models were built using NB and KNN 
classifiers and MI and Chi-square feature selection algo- 
rithms for the comparison purpose. 

Table 4 shows the performance of the KNN classifier 
with and without applying feature selection methods. 


Table 4. Evaluation of the KNN classifier 


Evaluation Metric | KNN.MI. | KNN.CHI. KNN 
Average accuracy 0.76 0.76 0.76 
Average Precision 0.77 0.77 0.77 

Average recall 0.76 0.76 0.76 
Average f-measure 0.76 0.75 0.76 


Contrary to expectations, applying MI and CHI does not 
improve the performance of the KNN classifier. 

Table 5 shows the performance of the NB classifier with 
and without applying feature selection methods. 


Table 5. Evaluation of the NB Classifier 


Evaluation Metric NB NB.CHI. NB.MI. 
Average accuracy 0.73 0.79 0.79 
Average Precision 0.64 0.81 0.81 
Average recall 0.73 0.79 0.79 
Average f-measure 0.67 0.78 0.78 


It can be seen from Table 5 that the performance of the 
NB classifier significantly improves when applying MI and 
CHI feature selection algorithms. As shown in Table 5, the 
average precision of Naive Bayes is improved by about 
17%. Its average recall is improved by 6%, its accuracy is 
improved by 6%, and its f-measure is improved by 11% 
when applying the MI feature selection method. 

Table 6 shows the comparison of the KNN (in its best 
form) with the NB (in its best form). 


Table 6. Comparison of the NB with the KNN 


Evaluation Metric NB.MI. KNN 
Average accuracy 0.79 0.76 
Average Precision 0.81 0.77 

Average recall 0.79 0.76 
Average f-measure 0.78 0.76 


As shown in Table 6, the results show that the NB clas- 
sifier can outperform the KNN when applying feature selec- 
tion methods and that the MI feature election method can 
optimize the results. 
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3.1.8 Choosing the Best Feature Space Length 

In the fourth phase of the evaluation process, the best 
size of feature space is made. For this purpose, the perfor- 
mance of the NB classifier using MI feature selection was 
evaluated several times through several different subsets of 
feature space. The subsets were generated by selecting 1, 2, 
3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 50, and also 80% of the total 
features. The results are shown in Figure 4. 
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Figure 4. Feature space length 


Based on Fig. 4, it can be seen that the performance of the 
classifier is the best when we use 8% of the feature space. 
Therefore, the final proposed model in this paper is a classi- 
fier model based on the Persika dataset using NB classifier 
and MI feature selection while selecting 8% of the features. 
We call this model PNC (Persika-based Persian News Clas- 
sifier). 

Table 7 compares the results of the present study with 
the performance of Naive Bayes algorithms in the study 
conducted by LEghbalzadeh, Hosseini, Khadivi, & 
Khodabakhsh (2012) to show the improvement in the per- 
formance of the NB classifier for Persian text classification 
when applying feature selection algorithms [45]. 


Table 7. Comparison of the two models for Persian text classifi- 
cation 


Evaluation metric NB NB-MI 
Accuracy 65.22 80 


Table 8 compares the best result of the proposed model 
by Eghbalzadeh, Hosseini, Khadivi, & Khodabakhsh (2012) 
with the best result of the present study to show how the 
performance of the model is improved. 


Table 8. Comparison of the two models for 
Persian text classification 


Evaluation Metric KNN (K=1) NB.MI 


70.18 80 


Accuracy 


As shown in Table 8, the proposed model has an accu- 
racy of 80%, which is about 10% higher than the previously 
proposed Persika-based model for Persian text classifica- 
tion. 

3.1.9 Evaluating the Proposed Model 

Finally, to evaluate the proposed model, we computed the 
precision, recall, and f-measure for each class, along with 
their macro-averaged values and the average accuracy of all 
categories. 


As shown in Table 7, the proposed model can perform 
well in classifying different subjects (different classes). For 
instance, the model can classify news in sport class with an 
f-measure of 95% and the news in the religion class with an 
average f-measure of 91%. 

The experimental results of the final proposed model are 
shown in Tables 9 and 10 below. 


Table 9. Experimental Results of the Proposed Model 


Evaluation Metric Precision Recall F-measure 
0.95 0.94 0.97 Sports 
0.91 0.94 0.88 Religion 
0.82 0.82 0.82 Judiciary 
0.82 0.75 0.90 Culture 
0.72 0.60 0.90 Politics 
0.66 0.55 0.83 Science 
0.86 0.97 0.62 Hygiene 
0.86 0.92 0.80 Economy 
0.78 0.81 0.75 History 
0.52 0.40 0.75 Social 
0.80 0.97 0.68 Education 


It can be seen in Table 8 that the proposed model has an 
average accuracy of 80% and an average f-measure of 80% 
with a standard deviation of 0.01. The results show that the 
proposed model can perform well in classifying Persian 
news articles; therefore, the Persika corpus as the dataset 
can help to build a classifier model for Persian news articles. 


Table 10. Evaluation of the Proposed Model 


Evaluation Metric Average Standard Deviation 
accuracy 0.80 0.01 
precision 0.81 0.01 

recall 0.80 0.01 
f-measure 0.80 0.01 


4. Conclusion 

In this study, the main aim was to suggest a classifier model 
based on the Persica dataset using Naive Bayes and K-Near- 
est Neighbor classifiers to see the performance of these clas- 
sifiers while applying two feature selection algorithms, MI 
and CHI. Also, the impact of feature space length on the 
performance of the model was evaluated to see the best 
length of feature space. 

The results of the present study show that using Naive 
Bayes classifier alongside the MI feature selection method 
can give the best precision, recall, f-measure, and accuracy 
among the evaluated methods. It is also concluded that using 
8% of the feature space can result in the best precision, re- 
call, f-measure, and accuracy. Our empirical results also 
show that the proposed classifier model can automatically 
classify Persian news articles with the average f-measure of 
80% and the average accuracy of 80%. 
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